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Abstract 

We present a new implementation of the numerical integration of the classical, 
gravitational, A^-body problem based on a high order Hermite's integration 
scheme with block time steps, with a direct evaluation of the particle-particle 
forces. The main innovation of this code (called HiGPUs) is its full paral- 
lelization, exploiting both OpenMP and MPI in the use of the multicore 
Central Processing Units as well as either Compute Unified Device Archi- 
tecture (CUDA) or OpenCL for the hosted Graphic Processing Units. We 
tested both performance and accuracy of the code using up to 256 CPUs 
in the supercomputer IBM iDataPlex DX360M3 Linux Infiniband Cluster 
provided by the Italian supercomputing consortium CINECA, for values of 
N < 8 millions. We were able to follow the evolution of a system of 8 million 
bodies for few crossing times, task previously unreached by direct summation 
codes. 
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1. Introduction 

The study of the motion of A^ point-like masses having initial positions 
and velocities r^o, v^o, ^ = 1, 2, . . . , A^ interacting through a pair- wise force 
that depends only on their positions is known as the N-body problem. Its 
applications can be found on both small and large scales starting from nuclear 
physics up to astrophysical problems. In this latter case, the interaction 
force is gravity, and, when adopting newtonian gravitational interaction, the 
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problem is referred as the classical, gravitational, newtonian N-body problem. 
Building an appropriate mathematical model for this problem is a crucial task 
for astrophysicists who want to give an exhaustive representation of objects 
from planetary systems to galaxy clusters. Actually, an explicit mathematical 
solution by series, as provided by [1], exists (upon some conditions) for an 
arbitrary but it requires such an enormous amount of terms to approach 
convergence that deprives it of any pratical usability. As a matter of fact, 
the gravitational A-body problem is explicitly solvable only for A < 3 while, 
for A > 3, the procedure to get a solution is exclusively numerical. 

The first numerical simulation of a self gravitating A-body system was 
carried out by Holmberg [2]. Accounting for the similar inverse square law 
scaling with distance of newtonian gravity and the light intensity coming from 
a light source, he evaluated the gravitational interaction between two model 
galaxies represented as two groups of 37 lamps. This required entire weeks to 
accomplish this 74-body simulation. The first numerical simulations of the 
movement of an A-body system were performed by von Hoerner [3] . Faster 
numerical integrations were not possible until the years 1960s, when the first 
digital computers were introduced and Aarseth carried out simulations up to 
250 stars [4]. From those times, Aarseth has been doing a deep work on high 
precision simulations of A-body systems, producing a series of codes (from 
NBODYO up to the last NB0DY7 release [5]); a good general reference to 
direct summation codes is Aarseth's book [6]. The well known (at least in 
the field of astrophysics) GrAvityPipE GRAPE project by Sugimoto, Hut 
and Makino around the end of years 80's (see http://www.astrogrape. 
org/) constituted a significant improvement for the A-body simulations. The 
GRAPE boards constitute actual 'gravity accelerators' attached to a host 
workstation and they are still in use (GRAPE-6) allowing high performances 
for a relatively low cost. Nevertheless, over the last 10 years, the Graphics 
Processing Units (GPUs) are replacing Central Processing Units (CPUs) and 
GRAPE boards. Actually, the purpose of the GPUs, until a few years ago, 
was to manage graphics with high levels of data parallelism. Actually, on a 
screen, many pixels must be updated at the same time and each information 
is completely independent from each other. A newer GPU concept was born 
mostly thanks to GUDA (Compute Unified Device Architecture), which was 
developed by Nvidia (2006) and can be considered an extension of some 
standard high level programming languages like C, C++ and Fortran. It was 
the begin of a new movement called General Purpose computing on GPUs 
(GPGPUs, see also http : //gpgpu. org/) which is in continuous development 
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and growth because graphic cards are, in general, easy to program and very 
cheap, exhibiting a huge computational power both in single and double 
precision precision floating point operations keeping power consumption low. 
A general review on the hardware and software developments since the 1960s 
that led to the successful application of Graphics Processing Units (GPUs) for 
astronomical simulations is given by [7]. This paper is organized as follows: 
in Sect. 2 we present the main aspects of the classic, gravitational A^-body 
problem; Sect. 3 contains the description of our new A-body Hermite's 
integrator (HiGPUs); in Sect. 4 we illustrate the hardware resources we 
used for benchmarking the code, whose results are extensively presented and 
discussed in Sect. 5. Conclusions are drawn in Sect. 6. In the following 
we will use the words body^ star and particle indifferently because we will 
consider objects belonging to an A-body system always as point-like masses. 



2. The numerical solution of the AT-body problem 

The numerical solution of the A-body problem is a diflBcult task, because 
of the so called double- divergence of the two-body interaction potential. Ac- 
tually, the newtonian interaction potential between a point of mass rrii and 
another of mass mj is given by 

GmiTrij GmiTrij 

~ Ir _ r I ~ r ~ 

where r^ and Yj are the position vectors of the z-th and the j-th star, 
G is the gravitational constant and r^j = |rj — r^| represents the euclidean 
distance between the two particles. As ultraviolet divergence we mean the 
singularity in the Uij potential for very close encounters (r^j 0); the infra- 
red divergence corresponds to a never vanishing pair-wise interaction. This 
double divergence leads to two immediate consequences: 

1. close encounters (r^j 0) yield to an unbound force between inter- 
acting stars {Fij oc) producing an unbound error in the relative 
acceleration; 

2. the resultant force acting on every particle of an A-body system re- 
quires summation over A — 1 pair- wise contributions, yielding to an 
O(A^) computational complexity, which can be overwhelming when- 
ever, as in the relevant astrophysical cases, A is very large (for instance, 
A c^i 10^^ for a typical galaxy). 
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Moreover, the evaluation of r^j is computationally heavy for it requires the 
evaluation of the irrational function square root which needs more than one 
floating point operation (less than ten, with most modern compilators) . 

The UV divergence is often faced introducing a softening parameter^ e, in 
the interaction potential which becomes 



that corresponds to substitute point masses with Plummer spheres of scale 
length e [8]. In this way, close encounters are smoothed but, of course, this 
is payed by a loss of resolution at scales of the order of e and below. On 
the other hand, to reduce the 0{N'^) complexity various strategies are possi- 
ble (for example mean-field methods). We do not enter here in a discussion 
on advantages, disadvantages and peculiarities of the many diflFerent ways 
suggested to reduce the computational complexity, we just state what al- 
most unanimously accepted: any of them induce some source of error, which 
can be systematic and not easy to be controlled. While we address to, e.g., 
[9], [10] for a more extended discussion of this topic, and to [11] for a thorough 
presentation of the field of application of A^-body simulation at scales of 
of order one million, we stress here that only the direct summation over the 
whole set of interacting bodies avoids the above mentioned errors but, obvi- 
ously, it is high computationally demanding. Clearly, the direct approach to 
the A^-body problem is such to take huge advantage by architecture where ef- 
ficient computing accelerator are coupled to general purpose processors, like 
in the hybrid CPU+GPU platform. Some authors have already approached 
this topic, expecially in the frame of the nVIDIA CUDA programming en- 
vironment; in this context we cite, for instance, Nyland et al. [12], Gaburov 
et al. [13]. In particular, a double parallelization (exploiting both OpenMP 
over the hosting multicore CPUs and CUDA over the hosted CPUs) has been 
implemented for a 6th order symplectic A^-body code by Capuzzo-Dolcetta 
et al. [14]. 

In this paper we choose to couple the power of CPUs as computational 
accelerators, exploited to evaluate pairwise forces, to the precision guaranteed 
by a high order (6th) Hermite integrator, as described [15], implemented with 
hierarchically blocked time steps [6] as a good compromise between accuracy 
and speed of calculation, as we show in the following. 



GmiTrij 
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3. The AT-body code 



The main purpose of this paper is to introduce and benchmarking a new 
fast and friendly A^-body code suitable for studying the dynamical evolution 
of stellar systems composed up to 10 millions of stars with the precision 
guaranteed by direct summation of the pair wise forces. 
The code, named HiGPUs, is freely available at astrowww.phys .uniromal . 
it/dolcetta/HPCcodes/HiGPUs .html. HiGPUs is, also, part of the AMUSE 
project ([16], http://amusecode.org/). HiGPUs is a direct summation A^- 
body code mainly aimed at applications in the astrophysical context but 
easily adaptable to use in different phyisics fields. It implements the Her- 
mite's 6th order time integration scheme [15] with block time steps, allowing 
both high precision and speed in the study of the dynamical evolution of 
star systems. The code is written combining tools of C and C++ program- 
ming languages and it is parallelized using Message Passing Interface (MPI, 
[17]), OpenMP [18], and Compute Unified Device Architecture (CUBA, [19], 
http : //developer . nvidia . com/category/zone/cuda-zone) developed by 
nVIDIA corporation to allow an easy utilization of nVIDIA CPUs. The 
use of these tools allows our code to exploit in full the most modern hybrid 
computer platforms which are usually composed by some standard CPUs 
connected to CPUs acting as high performance computing accelerators. 

We have also developed another version of the code based on Open 
Computing Language (OpenCL, [20] , http : / /http : / / developer . amd . com/ 
zones/OpenCLZone/Pages/def ault . aspx) which is an open standard for 
parallel programming of heterogeneous systems, allowing the use of CPUs 
of diflFerent firms. This latter version has been successfully tested on differ- 
ent AMD CPUs (Radeon series 6xxx and 7xxx) on our private workstations 
at the Dep. of Physics, Sapienza University of Roma, exhibiting performance 
comparable or even higher than that obtained with nVIDIA Tesla C2050. 

The coarse-grained parallelization is such that a one-to-one correspon- 
dence between MPI processes and computational nodes is established and 
each MPI process manages all the CPUs available per node. The task de- 
composition on each node is done taking into account the computational 
complexity of each section of the Hermite's 6th order scheme, which con- 
sists of a predictor^ an evaluation^ and a corrector step. In the following we 
will indicate with dots derivatives respect to time. Let us consider a system 
composed by N stars, and let us assume that the i-th particle, has, at time 
tc,05 a position r^^o, a velocity v^^o, an acceleration a^^o, a jerk a^^o, a snap 
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a^^o, a crackle and an individual time step At^^o- Calling m the number 
of particles belonging to the same time-block, which have to be evolved to 
the same time tc,o + ^U,o^ the generic Hermite's step is composed by various 
substeps: 

1. Prediction step, with 0{N) complexity: positions, velocities and accel- 
erations of all the stars are predicted using their known values: 



^i.pred = r^,o + Vi,oAt^,o + ^a^,oAt • q + ^a^,oAt • q + (3) 
+ ^a,,oAtJo + ^a,,oAt5o, (4) 
^i.pred = v^^o + ai,oAti,o + Ja^,oAt • q + ^a^^oAt • q + (5) 



1 ... . ,4 



+ -a,,oA^-o, (6) 

^i,pred = ^i,0 + ^i^O^U^O + ^'^i^O^t^Q + - a^^oAt^Q' i^) 

2. Evaluation step, with 0{Nm) complexity: the accelerations of m < N 
particles as well as their first and second time derivatives are evaluated 
using the above predicted data. The mutual interaction between the 
z-th particle and the remaining — 1 is described by the following 
relations: 
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6ttaij- 1 - 3Pija.ij^-i^ , (10) 



where r^j — ^j^pred ^i^pred) ^ij — ^j^pred ^i^pred) ^ij — ^j^pred ^i^pred) 
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3. Correction step with complexity 0{m)\ positions and velocities of the 
mentioned m particles to be updated are corrected using the above 
evaluated accelerations and their time derivatives: 



^i.coTT = v^^oH ^(a^^i+a^^o) ^ (a^,! - a^^o) + (H) 

+ ^(a,i + a,o), (12) 

^i,corr — ^i,0 H - — ['^i,corr "T v^^'l ~ ^^'0/ V ^ 

+ ^(a,i + a,o). (14) 

The individual time steps for m particles are, thus, updated, by mean 
of the so called generalized Aarseth criterion [15] 



/p-3 



^(1) 



where 



A^'^ = ^|a(--i)| |a(^+i)| + |aW|^ (16) 

In Eqs. 15 and 16, p is the order of the integration method and a^^^ is 
the 5-th time derivative of the acceleration. The value of the parameter 
T] is linked to the accuracy required for the simulation; we, practically, 
found that an optimal value for r] for our Hermite's 6th order scheme is 
around 0.6. In any case, the block time steps scheme forces the values 
of Ati^i to be quantized as powers of two for the sake of an easier 
control of both the synchronization of all the stars and an efficient 
code parallelization. In the practical applications of our code, we set 
as minimum time step Atmin = 2~^^ 3.0 • 10~^, while we fixed as 
maximum Atj^ax = = 0.125 (both are expressed in program's time 
unit which coincides with the system crossing time, as defined in Section 
5). 

We note that the evaluation step is the most expensive in terms of number of 
operations needed to execute; the predictor step comes after. For this reason 
we implement both these substeps on the CPUs leaving the less demanding 



7 



corrector step to the CPUs. In more detail, if n is the number of CPUs used 
for a simulation, each GPU deals with the predictor of N/n particles and 
evaluates 3m{N/n) accelerations and their first and second order derivatives, 
collected and reduced from the all set of computational nodes by means of 
the MPI_Allreduce() functions [17] 

The implementation of the evaluation step on a GPU as efficiently as pos- 
sible requires some consideration on the structure of the GPU. For instance, a 
GPU nVIDIA Tesla C2050 is capable to run up to 21, 504 threads in parallel, 
this number corresponding to the combination of 14 Multiprocessors (MPs) 
which can run up to 1536 resident threads, each. To exploit the full power 
of such a GPU is, therefore, necessary to run 21,504 threads in parallel. If 
the stars to be updated are m and the GPUs to use are n, the simplest, but 
not most efficient, parallelization scheme of the forces calculation would be 
such to run m threads per GPU and calculate the partial accelerations (and 
their derivatives) due to N/n bodies. This is simple but if m is small (less 
than 21, 504) there is not enough work to fully occupy the GPU, causing a 
significant degrading of performance. In this low m case, in order to increase 
the number of GPU thread blocks to map to as many MPs as possible, our 
program tries to reduce the number of threads per block to use in the com- 
putation, starting from 128 down, until this number reduces to the minimum 
possible, set to 32 [21]. Anyway, this may be not enough to guarantee a 
good load of the GPU. To cope with this we introduced a variable, which 
we call Bfactor^ acting as factor that multiplies, when necessary, the total 
number of GPU blocks of threads in order to split further the computation. 
For example, if m < 21,504 and Bfactor = B we run m ^ B threads per 
GPU and the partial accelerations (and derivatives) due to N/ [nB) stars are 
computed. Obviously, before passing the results to the CPU, each GPU has 
to reduce B blocks of accelerations. This adds a GPU reduction operation 
to our code but, as we will see, this cost is amortized by what is gained in 
the evaluation kernel. In any case, the program can recognize the GPU in 
use, calculate the minimum number of parallel threads to fully load it, and 
determine the Bfactor maximum value by using the formula 

r 1 

BmAX = (17) 

\_n * TpB _ 

where we have used TpB to indicate the number of GPU threads per 
block. The advantages of using this technique will be made clearer in Section 
5 where we compare performance (for different values of m) of the evaluation 
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step with and without the use of the Bfactor variable. 



4. Hardware 

Our new A^-body code works under standard Linux distributions and 
requires CUDA (or OpenCL), gcc compiler and OpenMPI implementations. 
The benchmarking was performed using the IBM iDataPlex DX360M3 Linux 
Infiniband Cluster available since June 2011 at the Italian supercomputing 
consortium CINECA (http://www.cineca.it/eii). It consists of 274 com- 
putational nodes which exchange data through a Qlogic QDR (40 Gb/s) 
Infiniband high-performance network. Each node has two GPUs (for a to- 
tal of 528 nVIDIA Tesla M2070 + 20 nVIDIA Tesla M2070-Quadro), two 
CPUs Intel Xeon Esa-Core Westmere E5645 running at 2.4 GHz and 48 GB 
of RAM memory. The operating system is Linux Red Hat EL 5.6 x86_64 
while the version of the gcc compiler installed and tested is the 4.4.4, the 
CUDA version is the 4.0 and the OpenMPI version is the 1.4.3. We re- 
mark that, at present, only about half of PLX is available for scientific 
users and the maximum number of computational nodes which can be ob- 
tained simultaneously is limited to 44 for 6 hours or 22 for 24 hours (https : 
//hpc . cineca. it/content/ibm-plx-user-guide). This is why some of the 
benchmarks shown in this work are partially incomplete; nevertheless, upon 
explicit request, we obtained a few computing hours to run the code on nodes 
and 128 nodes, in order to test and evaluate performance with 4M and 8M 
stars. 

5. Results 

In this Section we show the results of our new direct A^-body code, both 
in terms of accuracy and performance. We performed a set of N-body simula- 
tions with values of A^ in the range from 32k to 8M stars, spatially distributed 
according to the mass density 



where r is the distance from the centre and b and M are, respectively, the 
scale length (also called core radius) and the total mass of the system. The 
choices 6 = 1, M = 1 and, for the gravitational constant, G = 1 as physical 




(18) 
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units for the A^-body simulations, lead to the so called system crossing time 
as unit of time for the code, written as 

hi 

tc = (19) 

We used a softening parameter e = 1.0 • 10""^, which is around 50 times 
smaUer than the minimum closest neighbour average distance for N = 8M. 
To perform our benchmarks we chose values of as powers of two. This is 
not compulsory but apt to guarantee reaching best optimization. 

5.1. Energy and angular momentum conservation 

The accuracy of our code is controlled by the parameter rj (see Section 
3). To test the accuracy we run A^-body simulations with N = 2'' with k 
integer between [15; 20] over 10 time units, checking both the energy and the 
angular momentum conservation over that time interval. Specifically, the 
relative errors are calculated using the expressions 



1 ^° 
" 10 ^ 



Ek{ti)-EkiO) 



Ek{0) 



1 ^° 
^ 10 ^ 

i=l 



Lk{ti) - Lfc(O) 



(20) 

where Ek{ti) and Lk{ti) are, respectively, the total energy and absolute 
value of angular momentum of the N = 2^ system evaluated at ten times 
ti^ which are multiples of the system crossing time. Moreover, the obtained 
values of AE^ and AL^ are averaged over the five values of A^. A similar 
approach to estimate the accuracy of their direct A^-body code was followed 
by Berczik et al. [22]. In Fig. 1 we show the results obtained for different 
values off]. As expected the energy error does not depend on r] when rj is small 
enough; for r] < 0.3 the relative energy error gets an almost constant value 
around 7.0 • 10~^^. Increasing the value of rj leads to a progressively worse 
energy conservation, yielding to (AEk) — 4.0 • 10~^ for = 1.0. A similar 
trend is noticed for the angular momentum error. We chose to maintain the 
energy error for our benchmarks below 5 • 10~^ and the angular momentum 
error around 5 • 10"'' so we set rj = 0.6. 

5.2. Code scalability 

Figure 2 shows the wall clock time needed to integrate an A^-body system, 
for diflFerent values of A^, up to one unit of time as a function of the number 
of GPUs used. 
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Figure 1: Averaged relative energy (upper panel) and angular momentum (lower panel) 
errors as a function of the accuracy parameter r]. 

As relevant result, the total execution time decreases linearly increasing 
the number of GPUs whenever the number of bodies is large enough (IM, 4M 
or 8M). The departure from this inverse linear trend is seen for < 262k and 
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2 4 8 16 32 64 128 256 

Number of GPUs 

Figure 2: Time needed to integrate A^-body systems (32/c < N < SM) over one time unit 
using different numbers of nVIDIA Tesla M2070 GPUs . 

when the number GPUs used is greater than 16. This is expected because, 
when the number of particles per GPU is smaU (< 1000), the computational 
load is not enough to exploit the full computational power of the GPUs thus 
to cover adequately memory latencies, MPI communications, and other non- 
scalable parts of our code. Another important output of Fig. 2 is that, using 
256 GPUs, an integration of a 8M-body system over one time unit is done in 
less than 10 hours which is, to our knowledge, an unprecedented performance 
for such kind of high precision, direct summation, A^-body simulations. 

5.3. Speedup and Efficiency 

A deeper analysis of the performance of our code may be done by mean 
of the use of parameters like the speedup (Sn) and the efficiency (En). The 
speedup quantifies how faster a parallel algorithm is respect to the corre- 
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2 4 8 16 32 64 128 256 

Number of GPUs 

Figure 3: The speedup of our code as function of the number of GPUs used. The straight 
dashed hue represents the trend of the perfect speedup. 

spending sequential one, and it is defined as: 

S. ^ f|, (21) 

where AT^ is the time spent to execute the program using n computational 
units (GPUs, in our case). A parallel algorithm is considered to be 'perfectly' 
written if the so called linear speedup is reached. This ideal situation corre- 
sponds to Sn = n. 

The efficiency indicates how the parallel algorithm exploits the whole 
available computational resources. It is strongly linked to the speedup and 
is usually expressed as: 

En = -. (22) 

n 

Low efficiencies mean a huge amount of time spent in data communications 
and/or synchronization events, that are, indeed, real bottlenecks for almost 
all highly parallel applications. 
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Figure 4: The efficiency of our code as function of the number of GPUs used. The 
horizontal dashed hue represents the trend of the perfect efficiency. 

Figures 3 and 4 draw how our code is able to integrate up to 8M particles 
keeping a very good efficiency (c^ 0.80) when using 128 nodes, decreasing to 
^ 0.70 for 4M. Nevertheless, the smaller the number of bodies, the 

worse the scalability of our code is. This is a behaviour common to this kind 
of numerical codes, direct consequence of, at least, two different factors: 

1. when the number of particles is small there is not enough work assigned 
to the generic GPU thread to cover adequately the latencies. This is 
due i) to the too frequent GPU global memory access compared to the 
computational load, ii) to the data transfer between GPUs and CPUs, 
or, in the worst case, iii) to idle GPU cores yielding the performance 
of the generic GPU to very low levels; 

2. increasing the number of GPUs and computational nodes implies the 
necessity to exchange and reduce data through a network connection, 
operation which becomes significant in terms of time spent if the num- 
ber of nodes in use is high (high latencies and low bandwidth whose 
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speed is around 40 Gb/s for the IBM PLX we used). 

At the hght of these observations, it is clear that the highest efficiency is 
reached whenever a right balance between the number of GPUs and the size 
(in terms of A^) of the astrophysical problem is reached. 

5.4' Code profiling 

To obtain a clear picture of what the critical, non scalable, parts of our 
code are, we divided the operations and tasks performed in a single time 
step into 9 parts and we measured their execution times. The schematic 
representation of our code and its main tasks is given and explained in Table 
1. 

To investigate the performance of the individual sections of our code, we 
will focus on a system composed of about IM stars {N = 2^° to be precise) 
chosen as reference because it exhibits an average behaviour among all our 
benchmarks. Following the notation already listed in Table 1, we report in 
Fig. 5 the fractional times spent to complete different sections of our code 
as a function of the number of computational nodes used. 

It is worth noted that the force calculation section of the code becomes 
less important in terms of execution time at increasing the number of nodes, 
reducing from about 100%, when using 1 node (2 GPUs) only, to 75% with 
32 computing nodes. Contemporarily, the relative contribution of the other 
code sections increases, especially the MPI communication part which goes 
from 0.2% with 1 node to 10% when we use 32 nodes. The same happens 
for the corrector step (7.5% of the total time with 32 nodes) which, however, 
has not been yet parallelized and for the CPU-to-GPU data transfer, which 
contributes for a maximum of 4% over the total execution time. 

Figure 6 is very helpful to identify possible bottlenecks in the code. It 
shows the time spent in the various sections of the code (as indicated in Table 
1) as a function of the number of computational nodes when integrating the 
IM-body system over 1 time unit. The figure shows that the two sections 
that scale increasing the number of GPUs are the evaluation step (which is 
the most relevant part) and the predictor step. 

The trend of the dependence of the evaluation step on the number of 
nodes, n^, is well fitted by 
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Table 1: The main sections constituing our code, which are performed at each time step. 
We indicate with n the number of GPUs used in the computation. The "convenient 
adjustment" mentioned in the description of the 5th section of our code refers to the re- 
organization of the computed and reduced accelerations and derivatives in one array only 
(instead of three) to improve the performance of the subsequent data transfer from the 
GPU to the CPU. In this way we execute one bigger copy instead of three smaller ones. 
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Figure 5: Relative (to the total) execution times of the different parts of our code (labeled 
as in Table 1) integrating an = IM system up to one unit of time using various numbers 
of computational nodes. 

with best fitting values a = 33, 213. Oil. 6 s and a = -0.99968±0. 00030. This 
denotes, according to Eq. 21, a very good speedup of our main gpu kernel, 
at least for A^=1M and up to 64 GPUs. Moreover, we checked that for G 
[32k; 8M] the value of a remains around —1, i.e. we get an approximatively 
linear speedup for the evaluation step. On the other hand, the At^pi part 
of the code, as shown in Fig. 6, grows with with an almost logarithmic 
scaling as a result of a complex combination of latency effects, low network 
bandwidth and inter-node reduction operations. The time spent in this part 
of the code is fitted by the expression 

Atn^^MPI = b + C logio (24) 

with best parameters b = 85.2 ± 5.2 s and c = 49.3 ± 5.7 s. The logarithmic 
growth of this section is common to all values of and may reduce signifi- 
cantly the efficiency and scalability of our code when using a large number 
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Figure 6: The times needed to complete the evaluation step, the predictor step, the MPI 
communications and the other sections of the code grouped together in the remaining 
curve. All the times refer to an IM stars system integrated over one time unit. 

of CPUs. 

At the light of the analysis above it is clear that a faster network con- 
nection could improve performance of our code significantly. Moreover, the 
corrector step (which constitutes about 70% of the constant part in Fig. 
6) may be furtherly speeded up by its parallelization using, for instance, 
OpenMP directives. 

5.5. Consequences of block time steps and optimizations 

Our direct summation A^-body code is implemented by mean of hierar- 
chically blocked time steps. This implies that the stars to be integrated in 
a generic time step may vary from 1 to A depending on how the blocks are 
populated. As a consequence, it is interesting to see what are the group of 
particles giving the biggest contribution in terms of the total execution time, 
in order to know where our code needs further optimization. To do this, 
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we measured the update frequencies for the whole set of stars over one time 
unit (in our IM-body reference case) using 32 computational nodes and then 
we multiplied these values by the sum of the times needed to complete each 
section of the whole time step for those bodies. After grouping particles into 
6 groups (labeled with the letters A, B, C, D, E and F) we obtained the 
results sketched in Fig. 7. 
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Figure 7: Considering the IM-body reference system, calling fm the update frequency of 
m particles over 1 time unit and calling Tm,s the time needed to complete the section s 
of our code for m stars, the percentage value Hq of each bar can be obtained as Hq = 
1^ J2s X^mGG ■ where Tfot is the total time needed to complete 1 time unit and G 
indexes the groups A {me [1; 20]), B {m e [21; 100]), C {m e [101; Ik]), D {m e [Ik; 10k]), 
E{me [10/c; lOOA;]), F {m e [lOOA;; IM]). 



Figure 7 show that, when using 32 nodes for the dynamical integration of 
IM stars, the evaluation time can become significantly smaller, for example, 
than that for the per-block particles determination (~ 29% for the A group, 
brown part) or than that for the GPU reduction of the partial forces due to 
the presence of the Bf actor (^ 30% for the A group, in yellow). Moreover, 
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the forces calculation time may be comparable to that needed to complete the 
predictor step (red) and to that needed to exchange data from the CPU to the 
GPU (pink). This situation becomes more evident when the number of nodes 
increases, while it fades, as expected, with larger number of bodies; the two 
effects tend to compensate, approximately, each other. Therefore, in order to 
obtain a better performance, it is worth performing a further improvement 
of our code in the case when the number of stars to be updated is small 
compared to A^. It must be noted that these results are strongly related 
to the chosen initial conditions. Actually, we checked that the inclusion of a 
massive or super-massive object (black hole) in the N-body system influences 
the local dynamics such that the contribution of the blocks containing less 
particles becomes much more signiflcant as previously shown by [23]. 

The use of the variable which we indicate as Bfactor (see Section 5) 
can improve signiflcantly performance in these regimes. In Fig. 8 we show 
the speed achieved in the forces calculation as a function of the number 
of bodies to be updated, having 16,384 stars per GPU, in the cases when 
the Bfactor optimization is active and when it is switched off. This mimics 
the situation in which an A^-body system of 1,048,576 stars is run using 64 
GPUs (32 IBM-PLX computational nodes). As we see in Fig. 8 the discussed 
optimization helps to improve performance up to a factor 50 when the number 
of particles to be updated is less then 20. This means that, referring to Fig. 
7, the contribution to the total time of the flrst bar (A) would have been 
around 50 times larger without introducing a Bfactor value becoming the 
real bottleneck for our applications. 

5.6. GPU memory used by HiGPUs 

As flnal benchmark, we investigated the maximum GPU memory used as 
a function of the number of stars to integrate. The results are shown in Fig. 
9. As we can see, on a GPU Tesla M2070 it is possible to handle up to A^ = 
8M stars, while using a Tesla C2050 A^ is reduced to 4M. 

This is, indeed, a real limit for our code applicability. 

5. 7. Hardware maximum performance 

The nVIDIA architecture named FERMI organizes the generic GPU as a 
group of 16 Streaming Multiprocessors of 32 Streaming Processors each for 
a total of 512 cores which often are called simply cuda cores. The GPUs 
tested in this work (Tesla M2070), using our N-body code, are based on the 
FERMI architecture having 448 active cuda cores over the 512 potentially 
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Figure 8: Speed (in GFLOPS) achieved updating different numbers of stars using 64 GPUs 
for a IM-body system. The red bars indicate the improved performance when the Bfactor 
optimization is active. 



available. Each of them has a clock frequency around 1.15 GHz and can 
execute up to two single precision floating point operations (32 bit) per clock 
cycle. This means that the theoretical performance peak in single precision 
can be determined by 

Speak = 448 cores • 2 flops/cycle • 1.15 GHz = 1030.4 GFLOPS. (25) 

On the other hand, up to 32 double precision floating point operations (64 
bit) can be performed per Streaming Multiprocessor, per clock cycle. Having 
14 active multiprocessors on a Tesla M2070 we get 

Dpeak = 14 multiprocessors -32 flops/cycle- 1.15 GHz = 515.2 GFLOPS, (26) 

which is exactly half of the single precision peak. It has to be underlined 
that this compute capability in double precision operations is unprecedented 
and it constitutes a big jump for high performance computing solutions 
because, before the advent of the FERMI architecture, the biggest prob- 
lem to use GPUs for scientific fields was indeed the lack of 64 bit opera- 
tions. Our GPU main kernel, which calculates the accelerations between 
stars and their higher order derivatives, uses double precision variables to 
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Figure 9: Used on-board GPU memory, in GB, as it grows with the number of stars of a 
generic A^-body system. 

store predicted positions, accelerations and derivatives, while it computes 
partial results in single precision mode using 32 bit registers. We prefer 
to use this hybrid strategy although other authors (see for example [13] 
or http: //crd-legacy . Ibl .govZ-^dhbailey/mpdist/) sometimes emulate 
double precision operations storing one 64 bit number in two 32 bit numbers 
yielding to 14 the number of significant digits and letting the use of GPUs 
which do not natively support double precision arithmetics (rare nowadays) . 
To estimate how much our code is capable to exploit the FERMI architecture 
we measured its peak performance. To do this, we counted how many floating 
point operations are enclosed in our evaluation kernel (the most expensive 
section in terms of computational load) and then we divided this value by the 
time needed to execute it, obtaining the performance expressed in GFLOPS. 
Other authors use different strategies to count operations [24, 12] but we 
prefer to refer to Table 2. 

Table 2 has been built following the Table 5-1 of the CUD A C program- 
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Operation 


CUDA expression 


Equivalent flops 


a±b 


a ± b 


1 


a • b 


a * b 


1 


1 

y/a 


rsqrt(a) 


4 


a 
b 


a/b 


5 


a' 


pow(a,b) 


9 



Table 2: The number of floating point operations required by the operations most relevant 
for our code. 



ming guide [21] together with the information given by the whitepaper of the 
FERMI architecture [25]. Specifically, we can notice that the power eleva- 
tion operation is very expensive, and it must be avoided, as much as possible, 
because it is implemented using a combination of one base-2 logarithm, one 
base-2 power elevation and one multiplication. Following Table 2 we counted 
15 double precision operations plus 82 single precision operations in our main 
kernel. Therefore, we estimate the theoretical peak achievable, per GPU, by 
the formula 

^ _ 82 ■ Speak + 15 • D,,ak ^ GFLOPS. (27) 

peak 82+15 ^ ^ 

This is of course a pure ideal value because it does not consider any kind of 
memory latency, communication and/or read and write operations which, in 
general, can reduce performance significantly. The formula that we derived 
to count GFLOPS in our main kernel is the following 

97 • • 777 

^ ^ 109 AT fN , (GFLOPS) (28) 

where is the total number of stars that form our N-body system, m is the 
number of particles to be updated and /STker{N ^m) is the kernel execution 
time. A similar formula has been used by other authors [15, 26]. We reached 
a performance over 100 TFLOPS using 256 Tesla M2070 with = 2^3 - 
8.4 X 10^ stars, which corresponds to ^ 400 GFLOPS per GPU, that is around 
40% of the claimed peak GPU performance. This is a very good result for 
this kind of astrophysical computations, especially in this case of heavy use 
of the double precision arithmetics, at least comparable to what obtained by 
other authors with similar (in structure) A^-body codes (see [26]). 
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6. Conclusions 

Composite architectures based on computational nodes hosting multicore 
Cnetral Processing Units connected to one or more Graphic Processing Units 
have been shown by various authors to be a clever and efficient solution to 
supercomputing needings in different scientific frameworks. Actually, these 
architectures are characterized by a high ratio between performance and 
both installation cost and power consumption. A practical proof of this 
is that some of the most powerful systems in the TOP 500 list of world's 
supercomputer are based on that scheme. They are indeed a valid alternative 
to massively parallel multicore systems, where the final computational power 
comes by the use of a very large number of CPUs, although each of them has 
a relatively low clock frequency. It is quite obvious that a full exploit of the 
best performance of the CPU+GPU platforms requires codes that clearly 
enucleate a heavy computational kernel, to be assigned in parallel to the 
CPUs acting as 'number crunchers' which release, periodically, their results 
to the hosts. In physics, the study of the evolution of systems of objects 
interacting via a potential depending on their mutual distance falls into this 
category. 

In this paper we presented and discussed a new, high precision, code apt 
to simulating the time evolution of systems of point masses interacting 
with the classical, pair, newtonian force. The high precision comes from 
both the evaluation by direct summation of the pairwise force among the 
system bodies and by a proper treatment of the multiple space and time 
scales of the system, which means resorting to an individudal time-stepping 
procedure and resynchronizations, as well as using a high order (6th) time 
integrator. 

In this paper we discussed the implementation of our fully parallel ver- 
sion of a direct summation algorithm whose O(A^) computational complexity 
is dealt with by CPUs acting as computational accelerators in the hosting 
nodes where multicore CPUs are governed and linked via MPI directives. 
The code, called HermitelntegratorCPUs (HiGPUs, available to the scientific 
community on astrowww . phys . uniromal . it /dolcetta/HPCcodes/HiGPUs . 
html or in the frame of the AMUSE project on amusecode . org) shows a very 
good performance both in term of scaling and efficiency in a good compromise 
between precision (as measured by energy and angular momentum conser- 
vations) and speed. We performed an extensive set of test simulation as 
benchmarks of our code using the PLX composite cluster of the CINECA 
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Italian supercomputing inter-university consortium. We find that the inte- 
gration of an = 8, 000, 000 bodies is done with an 80% efficiency, that is a 
deviation of just 20% from the linear speedup when using 256 nVIDIA Tesla 
M2070. This corresponds to less than 10 hours of wall clock time to follow 
the evolution of the 8M body system up to one internal crossing time, per- 
formance, at our knowledge, never reached for such kind of simulations. This 
means that with HiGPUs it is possible to follow the evolution of a realistic 
model of Globular Cluster (a spherical stellar system orbiting our and other 
galaxies and composed by about 1,000,000 stars packed in a sphere of about 
10 pc radius) with a 1:1 correspondence between number of real stars in the 
system and simulating particles over a length of about 10 orbital periods 
around the galactic center in few days of simulation. These kind of simula- 
tions will allow, for instance, a thorough investigation of open astrophysical 
questions that may involve, in their answer, the role of globular clusters and 
globular cluster systems in galaxies. We cite the open problem of the origin 
of Nuclear Clusters as observed in various galaxies, like our Milky Way. Some 
authors (e.g., [27] and [28]) suggested a dissipational, gaseous origin while 
others ([29], [30]) indicate as more realistic a dissipationless origin by orbital 
decay and merging of globular clusters, as already numerically tested in [31] 
and [32]). 

One limit in the use of our code is the GPU memory: with a 6GB RAM, 
as in the case of nVIDIA Tesla M2070, the upper limit in A^ is - 8, 400, 000, 
which is, anyway, a number sufficiently large to guarantee excellent resolution 
in the simulation of most of the astrophysically interesting cases involving 
stellar systems. A the light of the results obtained in this paper, we are 
convinced that it is worth testing some other commercially available high- 
end CPUs apt to work in a scientific environment, like, for instance, the 
AMD of the HD6970 and 7970 series, which we showed (astrowww.phys . 
uniromal.it/dolcetta/HPCcodes/HiGPUs.html) to be absolutely compet- 
itive in terms of performance/cost ratio. This suggests as clever the idea of 
adopting for department sized high end computing platforms the solution of 
few tens of nodes composed by exacore CPUs connected to 2 up to 4 HD7970 
GPUs acting as computing accelerators; all this at a very reasonable cost, 
both on the purchase cost and power consumption. 
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